Machine Learning: Mathematical Theory and Applications
Computer lab 1
Hugo Xinghe Chen
Published
September 6, 2024
Warning
The instructions in this lab assume that the following packages are installed:
splines
caret
Packages may be installed by executing install.packages('packagename'), where 'packagename' is the name of the package, e.g. 'splines'. You may have to install additional packages to solve the computer lab. If you want this file to run locally on your computer, you have to change some paths to where the data are stored (see below).
Introduction
This computer lab treats topics such as polynomial and spline regression, regularisation, cross-validation, regression and decision trees, and classification methods.
Instructions
In this computer lab, you will work individually but you are free to discuss with other students, especially the one student you will work with on Computer labs 2 and 4! However, you have to hand in your own set of solutions. It is strictly forbidden to copy each others codes or solutions, as well as the use of AI tools (such as ChatGPT). Not obeying these instructions is regarded as academic misconduct and may have serious consequences for you.
This computer lab is worth a total of 15 marks (the subject has a maximum of 100 marks). The maximum marks for a given problem is shown within parenthesis at the top of the section. The problems you should solve are marked with the symbol 💪 and surrounded by a box. Hand in the solutions and programming codes in the form of a html document generated by Quarto. Before submitting, carefully check that the document compiles without any errors. Use properly formatted figures and programming code.
Warning
Not submitting according to the instructions may result in loss of marks. The same applies to poorly formatted reports (including poorly formatted code/figures).
1. Polynomial regression for bike rental data (2 marks)
In this problem, we consider modelling the number of rides each hour for a bike rental company. The raw data are stored in the file bike_rental_hourly.csv, which can be downloaded from the Canvas page of the course1.
The dataset contains 17379 hourly observations over the two-year period January 1, 2011 - December 31, 2012. The dataset contains many features that may be useful for predicting the number of rides, such as the hour of the day, temperature, humidity, season, weekday, etc. In this section, we consider modelling the logarithm of the number of rides as a function of the time of the day (scaled to the interval [0,1], see below). The following code reads in the data (note that you have to change to the path were you stored the file!) and creates the variables of interest (log_cnt and hour).
bike_data$log_cnt <-log(bike_data$cnt)bike_data$hour <- bike_data$hr/23# transform [0, 23] to [0, 1]. 0 is midnight, 1 is 11 PMplot(log_cnt ~ hour, data = bike_data, col ="cornflowerblue")
The number of rides seem to peak in the morning (8 AM \(8/23 \approx 0.35\) ) and after work (6 PM \(18/23 \approx 0.78\)).
We start by fitting a polynomial of order 2 to a training dataset consisting of Feb 1, 2011 - March 31, 2011. We use the following two months (April 1, 2011 - May 31, 2011) as test dataset to evaluate the predictions. The following code creates the training and test datasets.
The following code estimates the model (a polynomial of order 2) using least squares and computes the root mean squared error (RMSE) for both the training data and the test data. Moreover, the code plots the training data, test data and the fitted polynomial in the same plot.
y_train <- bike_data_train$log_cnty_test <- bike_data_test$log_cntp <-2# Order of polynomialX_train <-cbind(1, poly(bike_data_train$hour, p, raw =TRUE, simple =TRUE))# Design matrix / matrix of features (including intercept)beta_hat <-solve(t(X_train)%*%X_train)%*%t(X_train)%*%y_train# Predict in-sample and compute RMSEy_hat_train <- X_train%*%beta_hat RMSE_train <-sqrt(sum((y_train - y_hat_train)^2)/length(y_train))# Predict out-of-sample and compute RMSEX_test <-cbind(1, poly(bike_data_test$hour, p, raw =TRUE, simple =TRUE))y_hat_test <- X_test%*%beta_hatRMSE_test <-sqrt(sum((y_test - y_hat_test)^2)/length(y_test))# Plot training data, test data, and fit on a fine grid.plot(log_cnt ~ hour, data = bike_data_train, col ="cornflowerblue", ylim =c(0, 8))lines(bike_data_test$hour, bike_data_test$log_cnt, type ="p", col ="lightcoral")hours_grid <-seq(0, 1, length.out =1000)X_grid <-cbind(1, poly(hours_grid, p, raw =TRUE, simple =TRUE))y_hat_grid <- X_grid%*%beta_hatlines(hours_grid, y_hat_grid, lty =1, col ="lightcoral")legend(x ="topleft", pch =c(1, 1, NA), lty =c(NA, NA, 1), col =c("cornflowerblue", "lightcoral", "lightcoral"), legend=c("Train", "Test", "Fitted curve"))
💪 Problem 1.1
Fit a polynomial of order 8 using least squares, without using R functions such as lm(). Plot the training data, test data and the fitted polynomial in the same plot.
# Generate polynomial features up to order 8 for training datap <-8X_train1 <-cbind(1, poly(bike_data_train$hour, p, raw =TRUE, simple =TRUE))y_train1 <- bike_data_train$log_cnt# Calculate the coefficients of the polynomial using least squaresbeta_hat1 <-solve(t(X_train1) %*% X_train1) %*%t(X_train1) %*% y_train1X_test1 <-cbind(1, poly(bike_data_test$hour, p, raw =TRUE, simple =TRUE))y_test1 <- bike_data_test$log_cnt# Predictionsy_hat_train1 <- X_train1 %*% beta_hat1y_hat_test1 <- X_test1 %*% beta_hat1RMSE_train1 <-sqrt(sum((y_train1 - y_hat_train1)^2) /length(y_train1))RMSE_test1 <-sqrt(sum((y_test1 - y_hat_test1)^2) /length(y_test1))plot(log_cnt ~ hour, data = bike_data_train, col ="cornflowerblue", ylim =c(0, 8))points(bike_data_test$hour, bike_data_test$log_cnt, col ="lightcoral")hours_grid1 <-seq(0, 1, length.out =1000)X_grid1 <-cbind(1, poly(hours_grid1, p, raw =TRUE, simple =TRUE))y_hat_grid1 <- X_grid1 %*% beta_hat1lines(hours_grid1, y_hat_grid1, lty =1, col ="green",lwd =2)legend(x ="topleft", pch =c(1, 1, NA), lty =c(NA, NA, 1), col =c("cornflowerblue", "lightcoral", "lightcoral"), legend =c("Train", "Test", "Fitted curve"))
💪 Problem 1.2
Fit polynomials of varying order 1-10 using least squares, without using R functions such as lm(). Compute the RMSEs for the training and test data and plot them on the same figure as a function of the polynomial order. Are you underfitting or overfitting the data?
RMSE_train <-numeric(10)RMSE_test <-numeric(10)y_train <- bike_data_train$log_cnty_test <- bike_data_test$log_cnt# Polynomial regression for orders 1 to 10for (p in1:10) { X_train <-cbind(1, poly(bike_data_train$hour, p, raw =TRUE, simple =TRUE)) beta_hat <-solve(t(X_train) %*% X_train) %*%t(X_train) %*% y_train X_test <-cbind(1, poly(bike_data_test$hour, p, raw =TRUE, simple =TRUE)) y_hat_train <- X_train %*% beta_hat y_hat_test <- X_test %*% beta_hat RMSE_train[p] <-sqrt(sum((y_train - y_hat_train)^2) /length(y_train)) RMSE_test[p] <-sqrt(sum((y_test - y_hat_test)^2) /length(y_test))}plot(1:10, RMSE_train, type ="b", col ="blue", pch =16, xlab ="Polynomial Order", ylab ="RMSE", main ="RMSE vs. Polynomial Order", ylim =c(0, max(c(RMSE_train, RMSE_test))))points(1:10, RMSE_test, type ="b", col ="red", pch =16)legend("topright", legend =c("Training Data", "Test Data"), col =c("blue", "red"), pch =16, bg ="white")
We can see there is a significant gap between the training and test RMSE values, the model is overfitting the data.
💪 Problem 1.3
Polynomials are global functions and their fit may be sensitive to outliers. Local fitting methods can sometimes be more robust. One such method is the loess method, which is a nonparametric method that fits locally weighted regressions to subsets of points and subsequently combines them to obtain a global fit. Use the loess function in R with the standard settings to fit a locally weighted regression to the training data. Is this method better than that in Problem 1.1? Plot the training data, test data and both fitted curves in the same plot.
Tip
The predict() function can be used on the object returned by loess().
# Fit regression using loessloess_fit <-loess(log_cnt ~ hour, data = bike_data_train)y_hat_train_loess <-predict(loess_fit, newdata = bike_data_train)y_hat_test_loess <-predict(loess_fit, newdata = bike_data_test)RMSE_train_loess <-sqrt(mean((y_train1 - y_hat_train_loess)^2))RMSE_test_loess <-sqrt(mean((y_test1 - y_hat_test_loess)^2))plot(log_cnt ~ hour, data = bike_data_train, col ="cornflowerblue", ylim =c(0, 8))points(bike_data_test$hour, bike_data_test$log_cnt, col ="lightcoral")#Problem1.1lines(hours_grid1, y_hat_grid1, lty =1, col ="green",lwd =2)# Loess fitlines(bike_data_train$hour, y_hat_train_loess, col ="purple", lwd =2, lty =2) legend(x ="topleft", pch =c(1, 1, NA, NA), lty =c(NA, NA, 1, 2), col =c("cornflowerblue", "lightcoral", "green", "purple"), legend =c("Train", "Test", "Polynomial Fit", "Loess Fit"))
cat("RMSE for Polynomial Regression (Train): ", RMSE_train1, "\n")
RMSE for Polynomial Regression (Train): 0.7232024
cat("RMSE for Polynomial Regression (Test): ", RMSE_test1, "\n")
RMSE for Polynomial Regression (Test): 1.021449
cat("RMSE for Loess Fit (Train): ", RMSE_train_loess, "\n")
RMSE for Loess Fit (Train): 0.8989078
cat("RMSE for Loess Fit (Test): ", RMSE_test_loess, "\n")
RMSE for Loess Fit (Test): 1.120946
As RMSE of Loess Fit are higher than the ones of Polynomial Regression, so the Loess Fit method is not better than that in Problem1.1 .
2. Regularised spline regression for bike rental data (3 marks)
Using the same training and test datasets as above, we will fit spline regressions with L1 and L2 regularisations (also known as penalties).
We create a natural cubic spline basis function for the variable hour with 25 equally spaced knots between 0.05 and 0.95. We first fit a regression using least squares without regularisation. The following code does this and plots the spline fit together with the training and test data. Note that we are not adding an intercept to the design matrix, since this is taken care of by the argument intercept=TRUE.
suppressMessages(library(splines))knots <-seq(0.05, 0.95, length.out =25)X_train <-ns(bike_data_train$hour, knots = knots, intercept =TRUE)X_test <-ns(bike_data_test$hour, knots = knots, intercept =TRUE)beta_hat <-solve(t(X_train)%*%X_train)%*%t(X_train)%*%y_train# Plot training data, test data, and spline fit on a fine grid.plot(log_cnt ~ hour, data = bike_data_train, col ="cornflowerblue", ylim =c(0, 8))lines(bike_data_test$hour, bike_data_test$log_cnt, type ="p", col ="lightcoral")hours_grid <-seq(0, 1, length.out =1000)X_grid <-ns(hours_grid, knots = knots, intercept =TRUE) # cbind(1, ns(hours_grid, knots = knots))y_hat_spline_grid <- X_grid%*%beta_hatlines(hours_grid, y_hat_spline_grid, lty =1, col ="green", lwd =1.5)legend(x ="topleft", pch =c(1, 1, NA), lty =c(NA, NA, 1), col =c("cornflowerblue", "lightcoral", "lightcoral"), legend=c("Train", "Test", "Spline"))
We have deliberately chosen too many knots and no regularisation so that the fit has a large variance, which is clearly evident from the figure above. Your task in the next problem is to fit a regularised spline regression.
💪 Problem 2.1
Fit a spline regression to the training data with an L2 regularisation using a suitable value of \(\lambda\), without using R functions such as glmnet(). Plot the fit together with the training and test data.
Tip
The least squares estimator when using an L2 penalty is known as the Ridge regression estimator. Moreover, to determine a suitable value of \(\lambda\), one approach is to fit several models (using different \(\lambda\) values, e.g. lambda_grid <- seq(0, 1, length.out=100)) and choose the value of \(\lambda\) that minimises the root mean squared error of the test data.
#cat("Best lambda value: ", best_lambda, "\n")#cat("RMSE for Test data with best lambda: ", min_rmse, "\n")
💪 Problem 2.2
Use the package glmnet to fit a spline regression with an L2 regularisation using the same basis functions as the previous problem. Find the optimal \(\lambda\) by applying the one-standard deviation rule when cross-validating the training data using 10 folds. Compute the RMSE (using the optimal \(\lambda\)) for the training and test data. Plot the fit together with the training and test data.
Tip
The help() function will be very useful here. Another useful function is cv.glmnet(), where cv stands for cross-validation. The argument alpha (to the cv.glmnet() function) is key. Finally, the predict() function can be used on the object returned by cv.glmnet() and the argument s (to the predict() function) controls which \(\lambda\) to use.
library(glmnet)
Loading required package: Matrix
Loaded glmnet 4.1-8
# Set up the alpha value for L2alpha <-0lambda_seq <-10^seq(0, -2, by =-0.1)# Cross-validationcv_fit <-cv.glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = lambda_seq, nfolds =10)optimal_lambda <- cv_fit$lambda.min + cv_fit$lambda.1se# Fit the final modelfinal_model <-glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = optimal_lambda)y_hat_train <-predict(final_model, s = optimal_lambda, newx = X_train2)y_hat_test <-predict(final_model, s = optimal_lambda, newx = X_test2)rmse_train <-sqrt(mean((y_train2 - y_hat_train)^2))rmse_test <-sqrt(mean((y_test2 - y_hat_test)^2))plot(log_cnt ~ hour, data = bike_data_train, col ="cornflowerblue", ylim =c(0, 8))lines(bike_data_test$hour, bike_data_test$log_cnt, type ="p", col ="lightcoral")hours_grid <-seq(0, 1, length.out =1000)X_grid <-ns(hours_grid, knots = knots2, intercept =TRUE)y_hat_spline_grid <-predict(final_model, s = optimal_lambda, newx = X_grid)lines(hours_grid, y_hat_spline_grid, lty =1, col ="green", lwd =2)legend(x ="topleft", pch =c(1, 1, NA), lty =c(NA, NA, 1), col =c("cornflowerblue", "lightcoral", "green"), legend =c("Train", "Test", "Regularized Spline"))
# Print the optimal lambda and RMSE for training and test datacat("Optimal lambda using the one-standard deviation: ", optimal_lambda, "\n")
Optimal lambda using the one-standard deviation: 0.4379179
cat("RMSE for Training data with optimal lambda: ", rmse_train, "\n")
RMSE for Training data with optimal lambda: 0.7112003
cat("RMSE for Test data with optimal lambda: ", rmse_test, "\n")
RMSE for Test data with optimal lambda: 1.00077
💪 Problem 2.3
Repeat Problem 2.2, however, use the optimal \(\lambda\) that minimises the mean cross-validated error. Compare the RMSE for the training and test data to those in Problem 2.2.
Tip
The help() function will be very useful here.
alpha <-0lambda_seq <-10^seq(0, -2, by =-0.1)cv_fit <-cv.glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = lambda_seq, nfolds =10)# Mean cvoptimal_lambda <- cv_fit$lambda.minfinal_model <-glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = optimal_lambda)y_hat_train <-predict(final_model, s = optimal_lambda, newx = X_train2)y_hat_test <-predict(final_model, s = optimal_lambda, newx = X_test2)rmse_train_optimal <-sqrt(mean((y_train2 - y_hat_train)^2))rmse_test_optimal <-sqrt(mean((y_test2 - y_hat_test)^2))cat("RMSE for Training data using mean cross-validation error (2.3): ", rmse_train_optimal, "\n")
RMSE for Training data using mean cross-validation error (2.3): 0.6913383
cat("RMSE for Test data using mean cross-validation error (2.3): ", rmse_test_optimal, "\n")
RMSE for Test data using mean cross-validation error (2.3): 1.004461
cat("RMSE for Training data using one-standard deviation (2.2): ", rmse_train, "\n")
RMSE for Training data using one-standard deviation (2.2): 0.7112003
cat("RMSE for Test data using one-standard deviation (2.2): ", rmse_test, "\n")
RMSE for Test data using one-standard deviation (2.2): 1.00077
We can see that they have close RMSE to each others. The one using mean cross-validation error performs better on the training data and worse on the test data than the one using one-standard deviation.
💪 Problem 2.4
Repeat Problem 2.2 using L1 regularisation. Compare the RMSE for the training and test data to those in Problem 2.2.
Tip
The help() function will be very useful here to set the argument alpha (to the cv.glmnet() function).
# Set up alpha value for L1 (lasso)alpha <-1lambda_seq <-10^seq(0, -2, by =-0.1)cv_fit <-cv.glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = lambda_seq, nfolds =10)# As 2.2 using one-standard deviationoptimal_lambda <- cv_fit$lambda.min + cv_fit$lambda.1sefinal_model <-glmnet(x = X_train2, y = y_train2, alpha = alpha, lambda = optimal_lambda)y_hat_train <-predict(final_model, s = optimal_lambda, newx = X_train2)y_hat_test <-predict(final_model, s = optimal_lambda, newx = X_test2)rmse_train_lasso <-sqrt(mean((y_train2 - y_hat_train)^2))rmse_test_lasso <-sqrt(mean((y_test2 - y_hat_test)^2))cat("RMSE for Training data with L1 regularization: ", rmse_train_lasso, "\n")
RMSE for Training data with L1 regularization: 0.7144238
cat("RMSE for Test data with L1 regularization: ", rmse_test_lasso, "\n")
RMSE for Test data with L1 regularization: 1.00522
cat("RMSE for Training data in Problem 2.2: ", rmse_train, "\n")
RMSE for Training data in Problem 2.2: 0.7112003
cat("RMSE for Test data in Problem 2.2: ", rmse_test, "\n")
RMSE for Test data in Problem 2.2: 1.00077
We can see that the one using L2 (2.2) performs better than the one using L1 (lasso) regularization.
3. Regularised regression for bike rental data with more features and data (2 marks)
We now consider the full dataset and use many more features. We use the first one and a half years (Jan 1, 2011 - May 31, 2012) of data for training and the last half year (June 1, 2012- Dec 31, 2012) for testing.
weekday: Sun (0), Mon (1), Tue (2), Wed (3), Thu (4), Fri (5), Sat (6).
season: Winter (1), spring (2), summer (3), fall (4).
We will use the common approach in statistics that creates \(K-1\) dummy variables for a categorical variable with \(K\) levels. The first level is the reference category. The following code constructs these so called one-hot encodings and adds them to the dataset bike_data.
# One hot for weathersitone_hot_encode_weathersit <-model.matrix(~as.factor(weathersit) -1,data = bike_data)one_hot_encode_weathersit <- one_hot_encode_weathersit[, -1] # Remove reference categorycolnames(one_hot_encode_weathersit) <-c('cloudy', 'light rain', 'heavy rain')bike_data <-cbind(bike_data, one_hot_encode_weathersit)# One hot for weekdayone_hot_encode_weekday <-model.matrix(~as.factor(weekday) -1,data = bike_data)one_hot_encode_weekday <- one_hot_encode_weekday[, -1] # Remove reference categorycolnames(one_hot_encode_weekday) <-c('Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')bike_data <-cbind(bike_data, one_hot_encode_weekday)# One hot for seasonone_hot_encode_season <-model.matrix(~as.factor(season) -1,data = bike_data)one_hot_encode_season <- one_hot_encode_season[, -1] # Remove reference categorycolnames(one_hot_encode_season) <-c('Spring', 'Summer', 'Fall')bike_data <-cbind(bike_data, one_hot_encode_season)
💪 Problem 3.1
Fit a spline regression to the training data with an L1 regularisation. Find the optimal \(\lambda\) by applying the one-standard deviation rule when cross-validating the training data using 10 folds. Compute the RMSE (using the optimal \(\lambda\)) for the training and test data. For the spline, use the splines package in R to create natural cubic splines basis functions of the variable hour with 10 degrees of freedom, i.e. use the ns() function with df=10 as input argument. Moreover, set intercept=FALSE and add it manually to your design matrix instead.
The following variables should be included in the regression:
Response variable: log_cnt.
Features: hour (via the cubic spline described above), yr, holiday, workingday, temp, atemp, hum, windspeed, and all the dummy variables created above.
Tip
First create the design matrix of interest (you have to exclude the original variables you did the one-hot coding for). When creating the spline basis for the test data, you will need to use the same knots as when constructing the basis functions for the training dataset. If the spline is stored in a variable spline_basis, you can extract the knots by using knots<-attr(spline_basis, "knots") and then use the knots argument when constructing the design matrix for the test data, e.g. ns(bike_all_data_test$hour, df=10, knots=knots, intercept = FALSE).
Which three features in Problem 3.1 seem to be the most important?
Tip
To explore which variables are most important, you can re-estimate the model in Problem 3.1 without cross-validation (using glmnet()). Note that you can apply the plot() function to an object returned by glmnet() using the arguments xvar="lambda" or xvar="norm", and label =TRUE", which allows you to visualise the lasso path.
Carry out a residual analysis in Problem 3.1 for the training data. What can you say about the assumption of independent residuals? Repeat the same for the test data.
Tip
Plot the autocorrelation function of the residuals to validate the independence assumption.
model_full <-glmnet(x = design_matrix_train,y = response_train,alpha =1, lambda = lambda_optimal)pred_train <-predict(model_full, newx = design_matrix_train, s = lambda_optimal)# Calculate residualsresiduals_train <- response_train - pred_train# Flatten residuals to a vectorresiduals_train <-as.vector(residuals_train)acf(residuals_train, main ="ACF of Residuals (Training Data)")
The residuals appear to have significant autocorrelation at multiple lags. This means the residuals are not independent. We can see there is a cyclical pattern in the autocorrelations. Since the residuals are not independent, the assumption of independent residuals is violated for the training data.
pred_test <-predict(model_full, newx = design_matrix_test, s = lambda_optimal)residuals_test <- response_test - pred_testresiduals_test <-as.vector(residuals_test)acf(residuals_test, main ="ACF of Residuals (Test Data)")
The residuals appear to have significant autocorrelation at multiple lags. This means the residuals are not independent. We can see there is a cyclical pattern in the autocorrelations. Since the residuals are not independent, the assumption of independent residuals is violated for the test data
4. Regularised time series regression for bike rental data (2 marks)
In this section we will consider the time series nature of the data.
💪 Problem 4.1
Plot a time series plot of the response in the original scale (i.e. counts and not log-counts) for the last week of the test data (last \(24\times 7\) observations). In the same figure, plot a time series plot of the fitted values (in the original scale) from Problem 3.1. Comment on the fit.
last_week_indices <- (nrow(bike_data_test) -24*7+1):nrow(bike_data_test)last_week_actual <- bike_data_test$cnt[last_week_indices]last_week_fitted <-exp(pred_test[last_week_indices])plot(last_week_actual, type ="l", col ="blue", lwd =2, ylab ="Counts", xlab ="Time", main ="Actual vs Fitted Counts for the Last Week")lines(last_week_fitted, col ="red", lwd =2, lty =2)legend("topright", legend =c("Actual Counts", "Fitted Counts"), col =c("blue", "red"), lwd =2, lty =c(1, 2))
According to the output, we can say that our fit has almost the same structure like the actual values, but it's obviously not accurate.
💪 Problem 4.2
Add time series effects to your model by including some lagged values of the response as features. Use the first four hourly lags of log_cnt plus the 24th hourly lag as features, in addition to all other features in Problem 3.1. Fit the model using an L1 regularisation and find the optimal \(\lambda\) by applying the one-standard deviation rule when cross-validating the training data using 10 folds. Compute the RMSE (using the optimal \(\lambda\)) for the training and test data and compare to Problem 3.1. Are the residuals from this new model more adequate?
Tip
The function mutate() from the dplyr is useful for adding columns to a data frame. Moreover, the function lags(x, k) lags the time series x by k steps.
Notice that some observations are lost when lagging, i.e. NA values are introduced. You can remove these observations from the dataset.
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
According to the plos, the predictions of 4.2 (green) seem to be more accurate than the ones of 4.1 (red). So by adding lags of the response variable, we can improve our predictions.
5. Regression trees for bike rental data (2 marks)
Another alternative to the semi-parametric approach above is to fit a regression tree.
💪 Problem 5.1
Using the training dataset from Problem 4.2 (that also includes lagged values of the response as features), fit a regression tree using the tree package. Experiment with the settings to see how changing them affects the results.
library(tree)# Combine X_train and y_train into a data frametrain_data <-data.frame(X_train, log_cnt = y_train)# Fit the regression tree modeltree_model <-tree(log_cnt ~ ., data = train_data)summary(tree_model)
Regression tree:
tree(formula = log_cnt ~ ., data = train_data)
Variables actually used in tree construction:
[1] "lag1" "X10" "X8" "lag4" "lag24"
Number of terminal nodes: 9
Residual mean deviance: 0.3258 = 3991 / 12250
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.51400 -0.34300 0.03172 0.00000 0.38000 2.33700
if (!require(rpart.plot)) install.packages("rpart.plot")
Loading required package: rpart.plot
library(rpart)library(rpart.plot)# Fit the regressionrpart_fit <-rpart(log_cnt ~ ., data = bike_data_train, method ="anova")rpart.plot(rpart_fit, main ="Regression Tree for Bike Rental Data")
💪 Problem 5.3
Add the predictions from Problem 5.1 to the figure you created in Problem 4.3. Comment on the fit of the regression tree compared to that of the semi-parametric spline approach.
last_week_indices <-1:(24*7) counts_tree <-exp(pred_test_tree[last_week_indices])time_points <-1:(24*7)plot(time_points, counts, type ="l", col ="blue", lty =1, lwd =2, ylab ="Counts", xlab ="Time")lines(time_points, counts_4_1, col ="red", lty =1, lwd =2)lines(time_points, counts_4_2, col ="green", lty =1, lwd =2)lines(time_points, counts_tree, col ="purple", lty =1, lwd =2)legend("topleft", legend =c("Actual Counts", "4.1 Counts", "4.2 Counts", "Tree Counts"), col =c("blue", "red", "green", "purple"), lty =c(1, 1, 1, 1), lwd =2)
We can see that the predictions of 5.1 tree model are not precious at all which is exactly the characteristic of it. But we can still see that it's sensible to peaks and valleys.
6. Logistic regression for classifying spam emails (2 marks)
The dataset spam_ham_emails.RData consists of \(n=4601\) spam (\(y=1\)) and ham (no spam, \(y=0\)) emails with corresponding 15 features2.
Most of the features are continuous real variables in the interval \([0, 100]\), with values corresponding to the percentage of occurrence of a specific word or character in the email. There are also a few features that capture the tendency to use many capital letters. The features are on different scales so we will standardise them to have zero mean and unit variance. The following code reads in the data (note that you have to change to the path were you stored the file!), standardises the features, and codes the response variable as a factor.
'data.frame': 4601 obs. of 16 variables:
$ spam : Factor w/ 2 levels "not spam","spam": 1 2 1 2 2 2 2 1 2 1 ...
$ over : num -0.35 1.658 -0.35 0.636 0.344 ...
$ remove : num -0.292 -0.292 -0.292 -0.292 0.194 ...
$ internet : num -0.263 0.411 -0.263 -0.263 -0.263 ...
$ free : num -0.3013 1.0307 -0.3013 -0.3013 -0.0713 ...
$ hpl : num 4.267 -0.299 2.666 -0.299 -0.299 ...
$ X. : num -0.33 0.384 -0.33 -0.285 -0.07 ...
$ X..1 : num -0.308 0.875 -0.308 -0.308 0.924 ...
$ CapRunMax : num -0.1702 -0.0881 -0.2267 -0.0984 0.4096 ...
$ CapRunTotal: num 0.021 0.0127 -0.4062 -0.1242 0.7565 ...
$ our : num -0.464 -0.464 -0.464 -0.464 -0.182 ...
$ hp : num 3.979 -0.329 -0.329 -0.329 -0.329 ...
$ george : num -0.228 -0.228 -0.228 -0.228 -0.228 ...
$ X1999 : num 4.99 -0.323 2.77 -0.323 -0.323 ...
$ re : num 0.5919 -0.0309 -0.2977 -0.2977 -0.2977 ...
$ edu : num -0.197 -0.197 -0.197 -0.197 -0.197 ...
The response is the first column spam and we will use all the other features to predict if an email is a spam or ham. Let us first inspect the fraction of spam emails in the dataset.
cat("Percentage of spam:", 100*mean(Spam_ham_emails$spam =="spam"))
Percentage of spam: 39.40448
We now split the data using the createDataPartition() function in the caret package. The default method in caret is stratified sampling, which keeps the same fraction of spam and ham in both the training and test datasets. We use 75% of the data for training and 25% for testing.
set.seed(1234)suppressMessages(library(caret))train_obs <-createDataPartition(y = Spam_ham_emails$spam, p = .75, list =FALSE)train <- Spam_ham_emails[train_obs, ]test <- Spam_ham_emails[-train_obs, ]# Confirm both training and test are balanced with respect to spam emailscat("Percentage of training data consisting of spam emails:", 100*mean(train$spam =="spam"))
Percentage of training data consisting of spam emails: 39.40887
cat("Percentage of test data consisting of spam emails:", 100*mean(test$spam =="spam"))
Percentage of test data consisting of spam emails: 39.3913
The following code fits a logistic regression on the training data using the glm() function, predicts the test data following the rule if \(\mathrm{Pr}(y=1|\mathbf{x})>0.5 \Rightarrow y=1\), and computes the confusion matrix using the caret package.
glm_fit <-glm(spam ~ ., family = binomial, data = train)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
y_prob_hat_test <-predict(glm_fit, newdata = test, type ="response")threshold <-0.5# Predict spam if probability > thresholdy_hat_test <-as.factor(y_prob_hat_test > threshold)levels(y_hat_test) <-c("not spam", "spam")confusionMatrix(data = y_hat_test, test$spam, positive ="spam")
Confusion Matrix and Statistics
Reference
Prediction not spam spam
not spam 658 65
spam 39 388
Accuracy : 0.9096
95% CI : (0.8915, 0.9255)
No Information Rate : 0.6061
P-Value [Acc > NIR] : < 2e-16
Kappa : 0.8087
Mcnemar's Test P-Value : 0.01423
Sensitivity : 0.8565
Specificity : 0.9440
Pos Pred Value : 0.9087
Neg Pred Value : 0.9101
Prevalence : 0.3939
Detection Rate : 0.3374
Detection Prevalence : 0.3713
Balanced Accuracy : 0.9003
'Positive' Class : spam
💪 Problem 6.1
Reconstruct the confusion matrix for the test data without using the confusionMatrix() function.
# Predictionsy_prob_hat_test <-predict(glm_fit, newdata = test, type ="response")threshold <-0.5y_hat_test <-as.factor(y_prob_hat_test > threshold)levels(y_hat_test) <-c("not spam", "spam")# Create matrixtrue_labels <- test$spam# True Positives: Predicted spam and it's actually spamTP <-sum(y_hat_test =="spam"& true_labels =="spam")# False Positives: Predicted spam but it's actually not spamFP <-sum(y_hat_test =="spam"& true_labels =="not spam")# True Negatives: Predicted not spam and it's actually not spamTN <-sum(y_hat_test =="not spam"& true_labels =="not spam")# False Negatives: Predicted not spam but it's actually spamFN <-sum(y_hat_test =="not spam"& true_labels =="spam")# Create the confusion matrixconfusion_matrix <-matrix(c(TN, FP, FN, TP), nrow =2, byrow =TRUE, dimnames =list("Actual"=c("not spam", "spam"), "Predicted"=c("not spam", "spam")))print(confusion_matrix)
Predicted
Actual not spam spam
not spam 658 39
spam 65 388
💪 Problem 6.2
Compute the accuracy, precision, sensitivity (recall), and specificity without using the confusionMatrix() function. Explain these four concepts in the context of the spam filter.
auc_value <-auc(roc_curve)cat("Area Under the Curve (AUC):", auc_value)
Area Under the Curve (AUC): 0.9650901
ROC curve shows the performance of the classifier for spam is pretty good. That the curve is bowed to top left suggests that the sensitivity is high-thus, catches most of the spam emails-while the false positive rate is low and incorrect spam classifications minimal.
The fact that the curve sits well above the diagonal line, which is the line of a random guess, indicates how effective this classifier is.
7. Decision trees for classifying spam emails (1 mark)
💪 Problem 7.1
Using the training dataset from Problem 6, fit a decision tree (classification tree) using the tree package with the default settings. How does this classifier perform on the test dataset compared to the logistic classifier in Problem 6?
library(tree)tree_fit <-tree(spam ~ ., data = train)summary(tree_fit)
Classification tree:
tree(formula = spam ~ ., data = train)
Variables actually used in tree construction:
[1] "X." "remove" "X..1" "george" "hp"
[6] "CapRunMax" "our" "free" "CapRunTotal"
Number of terminal nodes: 14
Residual mean deviance: 0.4606 = 1583 / 3437
Misclassification error rate: 0.08056 = 278 / 3451
library(tree)tree_fit <-tree(spam ~ ., data = train)#plot(tree_fit)#text(tree_fit, pretty = 0)if (!require(rpart)) install.packages("rpart")if (!require(rpart.plot)) install.packages("rpart.plot")library(rpart)library(rpart.plot)rpart_fit <-rpart(spam ~ ., data = train, method ="class")rpart.plot(rpart_fit, main ="Decision Tree for Spam Classification")
8. k-nearest neighbour for classifying spam emails (1 mark)
💪 Problem 8.1
Without using any package, fit a k-nearest neighbour classifier to the training dataset from Problem 6. Choose the value of \(k\) that minimises some suitable error function for the test data.
Tip
Note that the RMSE is not a suitable error function because the labels are binary. The missclassification rate is an alternative you can use. To choose the optimal value of \(k\), you may compute the missclassification rate of the test data for several candidate values of \(k\) and choose that which minimises this quantity.
To always ensure a majority vote, consider only odd numbers \(k\).
# Function to calculate Euclidean distanceeuclidean_distance <-function(x1, x2) {sqrt(sum((x1 - x2)^2))}# Function to classify a single test point using k-NNknn_classify <-function(test_point, train_data, train_labels, k) { distances <-apply(train_data, 1, function(train_point) euclidean_distance(test_point, train_point)) neighbors <-order(distances)[1:k] neighbor_labels <- train_labels[neighbors]return(names(sort(table(neighbor_labels), decreasing =TRUE))[1])}# Function to perform k-NN classification on the test setknn_predict <-function(test_data, train_data, train_labels, k) {apply(test_data, 1, function(test_point) knn_classify(test_point, train_data, train_labels, k))}load(file ='/Users/chenxinghe/Desktop/Hugo/ESILV/A5/S9_UTS/37401_ML/ComputerLab/1/spam_ham_emails.RData')# Standardize the featuresSpam_ham_emails[, -1] <-scale(Spam_ham_emails[, -1])Spam_ham_emails['spam'] <-as.factor(Spam_ham_emails['spam'] ==1) # Changing from 1->TRUE, 0->FALSElevels(Spam_ham_emails$spam) <-c("not spam", "spam")# Split the data into training and test setsset.seed(1234)library(caret)train_obs <-createDataPartition(y = Spam_ham_emails$spam, p = .75, list =FALSE)train <- Spam_ham_emails[train_obs, ]test <- Spam_ham_emails[-train_obs, ]train_data <-as.matrix(train[, -1])train_labels <- train$spamtest_data <-as.matrix(test[, -1])test_labels <- test$spam# Define possible values for k (only odd values)k_values <-seq(1, 9, by =2) misclassification_rates <-numeric(length(k_values))# Compute misclassification rates for each value of kfor (i inseq_along(k_values)) { k <- k_values[i] predictions <-knn_predict(test_data, train_data, train_labels, k) misclassification_rate <-mean(predictions != test_labels) misclassification_rates[i] <- misclassification_ratecat("k =", k, "Misclassification rate =", misclassification_rate, "\n")}
k = 1 Misclassification rate = 0.1052174
k = 3 Misclassification rate = 0.09217391
k = 5 Misclassification rate = 0.0973913
k = 7 Misclassification rate = 0.0973913
k = 9 Misclassification rate = 0.1069565
How does the classifier in Problem 8.1 perform on the test dataset compared to the logistic classifier in Problem 6 and the decision tree classifier in Problem 7?
Tip
Since the k-nearest neighbour method only obtains a sharp prediction (i.e. 0 or 1 directly without predicting the probability first from which we can choose a threshold for classification), we cannot readily compute the ROC curve. Thus, obtain sharp predictions from Problems 6 and 7 first and use them for comparison to the k-nearest neighbour method.
# Create a data frame for metricsmetrics_table <-data.frame(Metric =rep(c("Accuracy", "Precision", "Sensitivity (Recall)", "Specificity"), times =3),Model =rep(c("Logistic Regression", "Decision Tree", "k-NN"), each =4),Value =c( metrics_logistic$accuracy, metrics_logistic$precision, metrics_logistic$sensitivity, metrics_logistic$specificity, metrics_tree$accuracy, metrics_tree$precision, metrics_tree$sensitivity, metrics_tree$specificity, metrics_knn$accuracy, metrics_knn$precision, metrics_knn$sensitivity, metrics_knn$specificity ))# Print the table using knitr::kable for a nicely formatted outputif (!require(knitr)) install.packages("knitr")
Loading required package: knitr
library(knitr)# Print the metrics tablekable(metrics_table, caption ="Comparison of Classifier Metrics", digits =3)
Comparison of Classifier Metrics
Metric
Model
Value
Accuracy
Logistic Regression
0.910
Precision
Logistic Regression
0.909
Sensitivity (Recall)
Logistic Regression
0.857
Specificity
Logistic Regression
0.944
Accuracy
Decision Tree
0.903
Precision
Decision Tree
0.896
Sensitivity (Recall)
Decision Tree
0.854
Specificity
Decision Tree
0.935
Accuracy
k-NN
0.908
Precision
k-NN
0.876
Sensitivity (Recall)
k-NN
0.892
Specificity
k-NN
0.918
# Create the bar plot with reordered modelsggplot(metrics_table, aes(x = Metric, y = Value, fill = Model)) +geom_bar(stat ="identity", position ="dodge") +theme_minimal() +labs(title ="Comparison of Classifier Metrics", x ="Metric", y ="Value") +scale_fill_brewer(palette ="Set1") +theme(axis.text.x =element_text(angle =45, hjust =1))
Here, we can see, after comparison, our logisctic regression has the best performance on accuracy, precision and specificity. The KNN model is a little better on sensitivity.